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Anomaly Detection 

This invention relates to anomaly detection, and to a method, an apparatus and 
computer software for Implementing it. More particularly, although not exclusively, it 
relates to detection of fraud in areas such as telecommunications and retail sales by 
5 searching for anomalies in data. 

it is known to detect fraud with the aid of fraud management systems which use hand- 
crafted rules to characterise fraudulent behaviour. The rules are generated by human 
experts in fraud, who supply and update them for use In fraud management systems in 
detecting fraud. The need for human experts to generate rules is undesirable because it 
LO is onerous, particularly if the number of possible rules is large or changing at a significant 
rate. 

It is also known to avoid the need for human experts to generate rules; i.e. artificial 
neural networks are known which learn to characterise fraud automatically by processing . " 
training data. They then detect fraud indicated in other data from characteristics so4 
15 learned. However, neural networks characterise fraud in a way that is not immediately 
Visible to a user and does not readily translate into recognisable rules. It is important to ; 
be able to characterise fraud in terms of breaking of acceptable rules, so this aspect of. 
neural networks is a disadvantage. 

Known rule-based fraud management systems can detect well-known types of fraud 
2Q because experts know how to construct appropriate rules. In particular, fraud over circuit- 
switching networks is well understood and can be dealt with in this way. However, 
telecommunications technology has changed in recent years with cirwit-switchlng 
networks being replaced by Internet protocol packet-switching networks, which can 
transmit voice and Internet protocol data over telecommunications systems. Fraud 
25 " associated with Internet protocol packet-switching networks is more complex than that 
associated with circuit-switching networks: this fe because in the internet case, fraud can 
manifest itseff at a number of points on a network, and experts are still learning about the 
potential for new types of fraud- Characterising complex types of fraud manually from 
huge volumes of data is a major task. As telecommunications traffic across packet- 
30 switching networks increases, it becomes progressively mors difficult to characterise and 
detect fraud. 
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The present invention provides a method of anomaly detection characterised in that it 
incorporates the steps oft- 

a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any available relevant background knowledge, a rule covering a 

5 proportion of positive anomaly examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

The method of the invention provides the advantage that it obtains rules from data, not 
human experts, and the rules are not invisible to a user- 
Data samples in the training data set may have characters indicating whether or not they 
10 are associated with anomalies. The Invention may be a method of detecting 
telecommunications or retail fraud from anomalous data and may employ Inductive logio 
programming to develop the rule set 

Each rule may have a form that an anomaly Is detected or otherwise by application of the 
rule according to whether or not a condition set of at least one condition associated with 
15 the rule is fulfilled, A rule may be developed by refining a most general rule by at least 
one oft 

a) addition of a new condition to the condition set; 

b) unification of different variables to become constants or structured terms in condition. 

A variable in a rule which is defined as being in constaint mode and is numerical is at 
20 least partly evaluated by providing a range of values for the variable, estimating an 
accuracy for each value and selecting a value having optimum accuracy. The range of 
values may be a first range with values which are relatively widely spaced, a single 
optimum accuracy value being obtained for the variable, and the method Including 
selecting a second and relatively narrowly spaced range of values in the optimum 
25 accuracy value's vicinity, estimating sr accuracy for each value in the second range and 
selecting a value in the second range having optimum accuracy. 

The method may include filtering to remove duplicates of rules and equivalents of rules, 
Le. rules having Tike but differently ordered conditions compared to another rule, and 
rules which have conditions which are symmetric compared to those of another rule. It 
30 may include filtering to remove unnecessary less than or equal to* ( n Iteq°) conditions- 
Unnecessary B Iteq n conditions may be associated with at least one of ends of intervals, 
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multiple Iteq predicates and equally condition and Iteq duplication. 

The method may include implementing an encoding length restriction to avoid overfilling 
noisy data by rejecting a rule refinement if the refinement encoding cost in number, of bits 
exceeds a cost of encoding the positive exampfes covered by the refinement 

5 Rule construction may stop if at least one of three stopping criteria is fulfilled as follows: 

a) the number Of conditions in any rule in a beam of rules being processed is greater 
than or equal to a prearranged maximum rule length, 

b) no negative exampfes are covered by a most significant rule, which is a rule that: 
i) Is present in a beam currently being or having been processed) 

10 ii) is significant, 

Hi) had Obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule accuracy ; 
value, and 

c) no refinements were produced which were eligible to enter the beam currently 
i5 being processed in a most recent refinement processing step (32). 

A most significant rule may be added to a list of derived rules and positive examples 
covered by the most significant rule may be removed from the training data set 

The method may include: 

a) selecting rules which have not met rule construction stopping criteria, 

20 b) selecting a subset of refinements of the selected rules associated with accuracy 
estimate scores higher than those of other refinements of the selected rules, and 

o) iterating a rule refinement, ffltering and evaluation procedure (32 to 38) to identify any 
refined rule usable to test data. - - - - 

In another aspect, the present invention provides computer apparatus for anomaly 
25 detection characterised fn that it is programmed to execute the steps of> 

a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any avaifable relevant background knowledge, a rule covering a 
proportion of positive anomaly examples of data In the tmining data set, and 



4 

b) applying the rule set to test data for anomaly detection therein. 

In a further aspect, the present Invention provides computer software for use in anomaly 

detection characterised in that it incoiporates instructions for controlling computer 

apparatus to execute the steps of> 
S a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any available relevant background knowledge, a rule covering a 
proportion of positive anomaly examples of data In the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

The computer apparatus and computer software aspects of the invention may have 
LO preferred features equivalent mutatis mutandis to those of the method aspect 

In order that the invention might be more fully understood, an embodiment thereof will 
now be described, by way of example only, with reference to the accompanying 
drawings, in which:- , 

Figure 1 is a flow diagram illustrating a procedure for characterisation of fraudulent 
15 transactions in accordance with the invention; and 

Figure 2 is another flow diagram illustrating generation of a rule set for use in 
characterisation of fraudulent transactions in the Figure i procedure. 

One example of an application of anomaly detection using the invention concerns 
characterisation of retail fraud committed in shops by cashiers. The invention in this 
20 example may be used in conjunction with current commercial systems that can measure 
and record the amount of money put into and taken out of cashiers* tills. Various kinds of * 
cashier behaviour may indfoate fraudulent or suspicious activity. 

In this example of the invention transactions from a number of different cashiers 9 tills 
were employed. Each transaction was described by a number of attributes including 

25 cashier identity, date and time of transaction, transaction type (e.g< cash or non-cash) 
and an expected and an actual amount of cashjn a till before and after a transaction. 
Each transaction is labelled with a single Boolean attribute which indicates "true 11 if the 
transaction is known or suspected to be fraudulent and -false" otherwise. Without 
access to retail fraud experts, definitions of background knowledge were generated in 

30 the form of concepts or functions relating to data attributes. One such function calculated 
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a number of transactions handled by a specified cashier and having a discrepancy: here 
a discrepancy is a difference in value between actual and expected amounts of cash in 
the till before and after a single transaction. 

In this example, the process of the Invention derives rules from a training data set and 
5 the definitions of basic concepts or functions associated with data attributes previously 
mentioned. It evaluates the rules using a test data set and prunes them if necessary. The 
rules so derived may be sent to an expert for verification or loaded directly Into a fraud 
management system for use in fraud detection. To detect fraud, the fraud management 
system reads data defining new events and transaotions to determine whether they are 
10 described by the derived rules or not When an event or transaction is described by a 
rule then an alert may be given or a report produced to explain why the event was 
flagged up as potentially fraudulent The fraud management system will be specific to a 
fraud application. 

Benefits of applying the invention to characterisation of telecommunications and retail. 
15 fraud comprise: / 

» Characterisations in the form of rule sets may be learnt automatically (rather 
than manually as in the prior art) from training data and any availably 
background knowledge or rules contributed by experts- this reduces costs 
and duration of the characterisation process; 

20 • Rule sets which are generated by this process are human readable and are 

readily assessable by human experts prior to deployment within a fraud 
management system; and 

* the process may employ relational data, which is common in particular 
applications of the Invention - consequently facts and transactions which are 
25 in different locations and which are associated can be finked together. 

The process of the invention employs inductive logic programming software implemented 
in a logic programming language called Prolog. This process has an objective of creating 
a set of rules that characterises a particular concept, the set often being called a concept 
description. A target concept description in this example is a characterisation of 
30 fraudulent behaviour to enable prediction of whether an event or transaction is fraudulent 
or not. The set of rules should be applicable to a new, previously unseen and unlabeited 
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transaction and be capable of indicating accurately whether it is fraudulent or not 

A concept is described by data which in . this example is a database of events or 
transactions that have individual labels indicating whether they are fraudulent or non- 
fraudulent. A label is a Boolean value, 1 or 0, indicating whether a particular event at 
5 transaction is fraudulent (1) or not (0). Transactions labelled as fraudulent are referred 
to as positive examples of the target concept and those labelled as non-fraudulent are 
referred to as negative examples of the target concept 

In addition to reoeiving labelled event/transactional data, the inductive logic programming 
software may receive input of further information, i.e. concepts, facts of interest or 
10 functions that can be used to calculate values of interest e.g. facts about customers and 
their accounts and a function that can be used to calculate an average monthly bin of a 
given customer. As previously mentioned, this further information is known as 
background knowledge, and is normally obtained from an expert in the relevant type of 
fraud. , ■'• •"" 

is As a precursor to generating a rule set, Before learning takes place, the labelled 
eventrtransactional data Is randomly distributed into two non-overlapping subsets - a 
training data set and a test data set. Here non-overlapping means no data item is 
common to both subsets. A characterisation or set of rules is generated using the 
training data set The set of rules is then evaluated on the test data set by comparing 

20 the actual fraudulent or otherwise label of each event/transaction with the equivalent 
predicted for it by the inductive logic programming software. This gives a value for 
prediction accuracy - the percentage of correctly assessed transactions in the test data 
set. Testing on a different data set of hitherto unseen examples, i.e. a set other than the 
training data set, is a good indicator of the validity of the rule set. 

25 The target concept description is a set of rules in which each rule covers or characterises 
a proportion of the positive (fraudulent) examples of data but none of the negative {non- 
fraudulent) examples. It is obtained by repeatedly generating individual rules. When a 
rule is generated, positive examples' which it covers are removed from the training data 
set The process then Iterates by generating successive rules using unremOved positive 

30 examples. I.e those still remaining in the training data set. After each iteration, positive 
examples covered by the rule most recently generated are removed. The process 
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continues until there are too few positive examples remaining to allow another ml© to be 
generated, This is known as the sequential covering approach, and is published in 
Machine Learning, T. Mitchell, McGraw-Hill, 1997. 

Referring to Figure 1, a process 10 involving applying the inductive logic programming 
5 software (referred to as an ILP engine) at 12 to characterising fraudulent transactions is 
as follows. Background knowledge 14 and straining data set 16 are input for processing 
at 12 by the ILP engine, which processes them to produce a set of rules 18. Rule set 
performance is evaluated at 20 using a test data set 22. 

Referring now also to Figure 2, processing 12 to generate a set of rules is shown in more 
10 detail. Individual rules have a form as follows: 

IF {set of conditions} THEN {behaviour is fraudulent} (1 ) 

: A search for each individual rule begins at 30 with a most general rule (a rule with rio 
conditions): searching is iterative (as will be described later) and generates a[ succession 
of rules, each new rule search beginning at 30- The most general rule Is: 

15 IF { } THEN target_predicate is true (2) y 

This most general rule is satisfied by all examples, both positive and negative, because it 
means that all transactions and facts are fraudulent. It undergoes a process of 
reffinement to make it more useful* There are two ways of producing a refinement to a 
rule as follows: 

20 • addition of a new condition to the IF{ } part of the rule; 

• unification of different variables to become constants or structured terms; 

Addition of a new condition and unification of different variables are standard 
expressions for refinement operator types though their implementation may differ 
between systems. A condition typically corresponds to a test on some quantity of 
25 interest, and tests are often implemented using corresponding functions in the 
background knowledge . When a new condition is added to a rule, its variables are 
unified with those in the rest of the rule according to user-specified mode declarations. 
Unification of a variable X to a variable Y means that all occurrences of X in the rule will 
be replaced by Y. A mode declaration for a predicate specifies the type of each variable 




s 



and its mode, A variable mode may be input, output, or a constant Only variables of the 
same type can be unified. Abiding by mode rules reduces the number of refinements 
than may be derived from a single rule and thus reduces the space of possible concept 
descriptions and speeds up the learning process. There may be more than one way of 
5 unifying a number of variables in a rule, in which case there witi be more than one 
refinement of the rule. 

For example, a variable X may refer to a list of items. X could be unified to a constant 
value [ ] which represents an empty list or to [Y|Z] whioh represents a non-empty list with 
a first element variable Y and the rest of the list is represented by another variable Z> 
10 Instantiating X by such unification constrains fts value. In the first case, X is a list with no 
elements and fn the second case it must be a non-empty list Unification acts to refine 
variables and rules that contain them. 

Variables that are defined as being in constant mode must be Instantiated by a constant 
value. Variables of constant type can further be defined by the user as either non 1 
15 numerical or numerical constants. 

- If a constant is defined as non-numerical then a list of possible discrete values for the 
constant must also be specified by a user in advance. For each possible value of the 
constant, a new version of an associated refinement is created in which the value is 
substituted in place of the corresponding variable. New refinements are evaluated using 
20 an appropriate accuracy estimate and tine refinement giving the best accuracy score is 
recorded as the refinement of the original rule. 

if a constant is specified as numerical, it can be further defined as either an integer or a 
floating-point number. A method for calculating a best constant in accordance with the 
invention applies to both integers and floating point numbers. If a constant is defined as 
25 numerical then a continuous range of possible constant values must be specified by a 
user In advance. For ©cample, if the condition was n minutes_pasL.the_hour(X/ then X 
could have a range 0-59. 

In an integer constant search, if a range or interval length for a particular constant is less 
than 50 in length, all integers (points) in the range are considered. For each of these 
30 integers, a new version of a respective associated refinement is created in whioh the 
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relevant integer is substituted in place of a corresponding variable and new rules are 
evaluated and given an accuracy score using an appropriate accuracy estimation 
procedure. The constants) giving a best accuracy score is(are) recorded. 

If the integer interval J&ngth is greater than 50, then a recursive process is carried out as 
5 follows: 

1. A proportion of the points (which are evenly spaced) in the interval length are 
sampled to derive an initial set of constant values. For example, in the 
M minutes_past_the_hour(X)" example, 1Q, 20, 30, .40 and 50 minutes might be 

'sampled. For each of these values, a new version of a respective refinement is 
10 created in which the value is substituted in place of a corresponding variable and a 
respective rule is evaluated for each value together with an associated accuracy 
estimate. 

2. a. if a single constant value provides the best score then a number of the values (the 
number & which is a user selected parameter in the ILP engine 12) either side of this! 

15 value are sampled. For instance, if the condition rninutes__past_thGL_hour(20) gave 
the best accuracy then the following more precise conditions may then be evaluated; 

• minutes_pastJthe_hour(15) 

• minute&^pasLffw_hour(16) 

• mmutes_j&stjtheJhQur(17) 
20 * minutes_past_th&Jtiour(13) 

» minutes _j>astthe_hour(19) 

• minutQs_pastJhe_hoiir(21) 

• minutes m j>astJhe_hour(22) 

• minutBS_past_the_hour(23) " ~ 
25 * mtnutes_past_the_hour(24) 

• minutes_pasTJfte_hour(25) 

If a single constant value in X = 15 to 25 gives the best accuracy score then that value is 
chosen as a final value of the constant X. 
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2. b. If more than one constant value provides the best score then If they are 
consecutive points in the sampling then the highest and lowest values are taken and 
the values ]n their surrounding intervals are tested. For example, if 
mmirte$_pasUheJiQur(2Q)s minutes_pasUhe_hour(30) and 

minutes_past_the_hour(40) all returned the same accuracy then the following points 
would be tested for accuracy : 

minutes__pastJtheJhour(15) 

minute$__past_Jha_hour( 16) 

minute$jpast_Jhe_hour(1 7} 

10 * minutes_pastJheJhour(18) 

rninute$_pastJheJhour(19) 

minut&s_past_Jh&_hour{4 1) 

minute$^pastJtheJiour(42) 

minutes _pa$t_thQ_haur(43) 

15 • min\Aesjpa$tJheJrQur(44) 

minute3jpastJtheJiQur(4S) 

If the accuracy score decreases at an Integer value N In the range 15 to 19 or 41 to 45, 
then (NT ) is taken as the constant in the refinement of the relevant rule. 

2- a If a plurality of constant values provides the best accuracy score, and the values 
20 are not consecutive sampled points then they are arranged into respective subsets of 
consecutive points. The largest of these subsets is selected, and the procedure for a 
list of consecutive points is followed as at 2b above: e.g. if 
minutes _pa$t_the_hour(20) t minutes_past_the_hour(3Q) and 

minutes _s>ast_Jhe_hour(&0) scored best then the subset minute$_j>astjhejtcut(20) 
25 - minutes j^astjhe_hour(30) would be chosen. Hr the largest interval consists of only 
one value,.then the procedure for a single returned value is followed as at 1 « above. 

The user can opt to conduct a beam constant search: here a beam is an expression 
describing generating a number of possible refinements to a rule and recording, all of 
them to enable a choice to be made between mem later when subsequent refinements 
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have been generated. In this example, N refinements of a rule, each with a different 
constant value are recorded. This can be very effective, as the 'best 1 constant with 
highest accuracy at one point In the refinement process 32 may not turn out to be the 
•best' value over a series of repeated refinement iterations. This avoids the process 32 
5 getting fixed in local non-optimum maxima. 

Some variables in conditions/rules may be associated with multiple constants: If SO each 
constant associated with such a variable is treated as an individual constant, and a 
respective best value for each is found separately as described above. An individual 
constant value that obtains a highest accuracy score for the relevant rule is kept and the 
10 corresponding variable is instantiated to that value. The remaining variables of constant 
type are instantiated by following this process recursively until all constant type variables 
have been instantiated (i.e. substituted by values). 

Once all refinements of a rule have been found, in accordance with the invention, the 
refinements are filtered at 34 to iGmove any rules that are duplicates or equivalents of 

15 others in the set Two rules are equivalent in that they express the same concept if their 
conditions in the IF {set of conditions} part of the rule are the same but the conditions are 
ordered differently. For example. IF {set of conditions} consisting of two conditions A 
and B is equivalent to IF {set of conditions} with the same two conditions in a different 
order, he* B and A. One of the two equivalent rules Is removed from the list of 

20 refinements and so is not considered further during rule refinement, which reduces the 
processing burden. 

Additionally, in accordance with the invention, symmetric conditions are not allowed In 
any ruie. For example, a condition ectuaI(X,2) means a variable X fs equal in value to 2, 
is symmetric to equal(2,X). i.e. 2 is equal in value to a variable X. One of the two 
25 symmetric rules Is removed from the list of refinements and so is not considered further . 

Pruning refinements to remove equivalent rules and symmetric conditions results In 
fewer rules to consider at successive iterations of the refinement process 32, so the 
whole rule generation process Is speeded up. Such pruning can reduce rule search 
space considerably, albeit the extent of this reduction depends on what application is 
30 envisaged for the invention and how many possible conditions are symmetries in this 
connection where numerical variables are involved symmetric conditions are usually 
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numerous due to the use of 'equals' conditions such as equal(Y,X). For exampls, in the 
retail fraud example, the rule search space can be cut by up to a third. 

A 'less than or equals' condition referred to as 'Iteq', and an 'equals' conditions are often 
used as part at the background knowledge 14. They are very useful conditions for 
5 comparing numerical variables within the data. For this reason, part of the filtering 
process 34 ascertains that equals and Iteq conditions in rules meet checking 
requirements as follows: 

• End of interval check.' This checks the end of intervals where constant values are 
involved: e.g. a condition iteq(A, 1000) means variable A is less than or equal to 
10 1000: It Is unnecessary if A has a user-defined range of between 0 and 1000, so 

a refinement containing this condition is removed In addition, lteq(lOQQ, A), 1000 
is less than or equal to A, should be equals(A, 1000) as A cannot be more than 
1000, Therefore, refinements containing such conditions are rejected. 

• • Multiple 'Iteq' predicate check: If two conditions lteq(A,X) and tteq(B,X) where A 
is and B are constants, are contained in the body of a rule, then one condition may 

be. removed depending on the values of A and B. For example, if IteqfSO.X) and 
lteq{40,X) both appear in a rule, then the condition Keq(30,X) is removed from the 
rule as being redundant, because if 40 is less then or equal to X then so also is 
30. 

20 * Equals and iteq duplication check: in accordance with the invention if the body of 
a rule contains both conditions lteq(C, Constant) and equalsfC, Constant), then 
only the equals condition is needed. Therefore, refinements containing Iteq 
conditions with associated equals conditions of this nature are rejected. 

Rule refinements are also filtered at 34 using a method called 'Encoding Length 
25 Restriction 1 disclosed by N. Lavrao and S. Dzeroski, Inductive Logic Programming: 
Techniques and Applications. Hlis Horwood, New York, 1994, It is based on a 'Minimum 
^Description Length 1 prinoipie disclosed by B. Pfahringer, Practical Uses of the Minimum 
Description Length Principle in inductive Learning, PhD Thesis, Technical University of 
Vienna, 1995. 



"IN I I III 




27-NOU-2003 17=35 FROM IP MftLUBW 



TO UK PATENT 



P. 16 



13 

Where training examples are noisy (i.e. contain incorrect or missing values), it is 
desirable to ensure that rules generated using the invention does not overfit data by 
treating noise present Jn the data as requiring fitting. Rule sets that overf ft training data 
may include some very Specffio rules that only cover a few training data samples. In 
5 noisy domains, it Is likely that these few Samples wift be noisy: noisy data samples are 
unlikely to indicate transactions which are truly representative of fraud, and so rules 
should not be derived to cover them. 

The Encoding Length Restriction avoids overfitfmg noisy data by generating a rule 
refinement only as long as the cost of encoding the refinement will not exceed the cost of 
10 encoding the positive examples covered by the refinement where 'cost* means number of 
bits. A refinement is rejected if this cost criterion is not met. This prevents mles 
becoming too specific, i.e. covering few but potentially noisy transactions. 

Once a rule Is refined, the resulting refinements are evaluated in order to identify those k 
which are best. Rules are evaluated at 36 by estimating their classification accuracy.; 

IS This accuracy may be estimated using an expected classification accuracy estimate" 
technique disclosed by fsL lavrac and S. Dzeroski, Inductive Logic Programming.^ 
Techniques and Applications. Ellis Horwood, New York, 1994, and by F. Zelezny and N.$ 
Lavrac, An Analysis of Heuristic Rule Evaluation Measures, J. Stefan Institute Technical 
Report, March 1999- Alternatively* it may be estimated using a weighted relative* 

20 accuracy estimate disclosed by N. Lavrac, P. Flach and B. Zupan, Rule Evaluation 
Measures: A Unifying View, Proceedings of the 9th International Workshop on Inductive 
Logic Programming (ILP-99). volume 1634 of Lecture; Notes in Artificial Intelligence, 

pages 174^1B5, Springer-Verfag, June 1999. A usef may decide which estimating 

■ 

technique Is used to guide a rule search through aj hypothesis space during rule 
25 generation. 

- Once refinements have been evaluated in terms of accuracy -they are then-tested for 
what is referred to in the art of rude generation as 'significance'. In this example a 
significance testing method is used which is basad on a likelihood ratio statistic disclosed 
r ' in the N. Lavrac and S. feeroski reference above. A rule is defined as 'significant" if its 
30 Hkelihood ratio statistic value is greater than a predefined threshold set by the user. 

If a rule covers n positive examples and m negative examples, an optimum outcome of 
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refining the rule is that one of its refinements {an optimum refinement) will cover n 
positive examples and no negative examples. A likelihood ratio for this optimum 
refinement can be calculated. A rule is defined as 'possibly significant 1 if its optimum 
refinement is significant. Note that it is possible that a rule may not actually be 
5 significant, but it may be possibly significant in accordance with this definition. 

A first rule under consideration in the process 12 is checked at 33 to sea whether or not 
ft meets rule construction stopping criteria: in this connection, the construction of an 
individual rule terminates when any one or more of three stopping criteria is fulfilled as 
foDows: 

10 1 . the number of conditions in any rule in a beam (as defined earlier) currently being 
processed is greater than or equal to a maximum rule length specified by the 
user. If a most significant rule (see at 2. below) exists this is added to the 
accumulating rule set at 40, 

2. a most significant rule covers.no negative examples - where the most significant 
15 rule is defined as a rule that is either present in the current beam, or was present 

in a previous beam, and this rule: 

a) is significant, 

b) obtained the highest likelihood ratio statistic value found so far, and 

c) obtained an accuracy value greater than the accuracy value of the most 
20 general rule (that covers all examples, both positive and negative), and 

3. the previous refinement step 32 produced no refinements eligible to enter the 
new beam; if a most significant rule exists it is added to the accumulating rule set 
at 40. 



Note that a most significant rule may not necessarily exist, if so no significant 
25 refinements have been found so far. ffir is tfie case that a most significant rule does not 
exist but me stopping criteria at 38 is satisfied, than no rule is added to the rule set at 40 
and the stopping criteria at 44 will be satisfied (as will be described later}* 

When a rule is added at 40, the positive examples It covers are removed from the 
training data at 42, and remaining or unremoved positive and negative examples form a 
30 modified training data set for a subsequent iteration (if any) of the rule search. 
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At 44 a check Is made to see whether or not the accumulating rule set satisfies stopping 
criteria. In this connection, accumulation of the rule set terminates at 46 (finalising ths 
rule set) when either of the following criteria is fulfilled, that is to say when either: 

• construction of a rule is terminated because a most significant rute does not exist, 
5 or 

• too few positive examples remain for further rules to be significant 

If at 44 the accumulating rule set does not satisfy the rule set stopping criteria, another 
most general rule is selected at 30 and accumulation of the rule set iterates through 
stages 32 etc. At any given time in operation of the rule generation process 12, there are 
10 a number (zero or more) rules for which processing has terminated and which havB been 
added in the accumulating rule set, and there are (one or more) evolving rules or proto- 
rules for which processing to yield refinements continues iterafivefy. 



If evolving rules are checked at 38 and are found notto meet any of the rule construction 
stopping criteria previously mentioned, those refinements, of such rules are chosen which 
15 have the best accuracy estimate scores. The chosen refinements then provide a basis v 
for a next generation of rules to be refined further in subsequent refinement iterations. v 
The user defines the number of refinements forming a new beam to be taken to a further 
iteration by fixing a parameter called 'beanovidih'. As has been said, a beam is a 
number of recorded possible refinements to a rule from which a choice will be made 
20 later, and beanruwidth is the number of refinements in it For a beam width IM, the 

refinements having the best N acounacy estimate scores are found and taken forward at 
48 as part of the new beam to the next iteration. The sequence of stages 32 to 38 then 
iterates for this new beam via a loop 50. 

Each refinement entering the new beam must 

25 • be possibly significant (but not necessarily significant),- and - - 

• improve upon or equal the accuracy of its parent rule (the rule from which it was 
derived by refinement previously). 

If required by the user, the accumulated rule set can be post-pruned using a reduced 
error pruning method disclosed by J. Furnkranz, A Comparison of Pruning Methods for 
30 Relational Concept Learning, Proceedings of AAAT94 Workshop on Knowledge 
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Discovery in Databases (KDD-94), Seattle, WA, 1994. In this case, another set of 
examples should be provided - a pruning set of examples. 

Examples of a small training data set, background knowledge and a rule set generated 
therfram will now be given. In practice there may be very large numbers of data samples 
5 in a data set 

Training data 

The trafnfng data is a transaction database, represented as Prolog facts fn a format 
as follows: 

10 transfTrans ID, Date, Time, Cashier, Expected amount in till, Actual amount in 

tilt, Suspicious Flag). Here "trans' and Trans' mean transaction and ID means 
identity. 

A sample of an example set of transaction data is shown below. Transactions 
with Suspicious Flag = 1 are fraudulent (positive examples), and with 
15 Suspicious Flag = 0 are not (negative examples) .The individual Prolog facts 

were: 

trans(1 ,30/08/2OO3,Q9:Q2 t cashier.J ,121 .87,123.96, 0). 
trans(2,3Q/oa/2003,08:56.cashier^1 ,1 19.38,121 .82, 0). 
'trans(3,3Q/0a/2003 t 08:50,cashier_1 .1 1 8.59, 1 1 9.38. 0). 
20 trans(4,3Q/oa/2003,OS:48 I cashier_1 ,1 16.50,1 1 8.59, 0). 

trans{5 F 3fl/OS/2003,08:44,cashier.1 .1 1 5.71 ,1 1 6.50, 0). 
trans(6 l 3a/0a/2003,22:40,cashier«2,431 .68,435.1 7, 0). 
trans(7 r 3Q/oa/2003 > ^:37 l cashier_2 f 423.70,431 .68, 1). 
trans(8 s 30/08/2003,22:35 r cashier_2 r 420.01 ,423-70, 0), 

25 

Background knowledge: 

Examples of appropriate background knowledge concepts, represented using 

Prolog, are: 

30 d iscrepancy (Trans J D , Discrepancy). 

This gives the discrepancy in UK £ and pence between the expected amount of 
cash in a till and the actual amount of rash Tn that till for a particular transaction, 
e.g.: 

discrepancy (1, 2.09). 
35 discrepancy(2, 2.44). 
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discrepancy(7, 7.98). 

totaLtransfCashier, Total number of transactions, Month and Year). 

This gives the total number of transactions made by the cashier In a given 

month and year, e.g.: 

5 totaLtrans(cashier_1 , 455, 08/2003). 

totaLtrans(cashier^2, 345 r 08/2003). 

number_of_trans_wfth_discrepancy(Cashier B Number, Month and Year), 
This gives the total number of transactions with a discrepancy made by a 
cashier In a given month and year t e-g.: 
id number_oL.transLwlth_discrepancy(cashier_.1 , 3B, 08/2003). 

number_of_ti^ns_with_discrepancy(cashier_2 T 93, 08/2003). 

number_of_trans_wtth_discrepanc^jreatBr_than(Cashier r Number, Bound, 
Month and Year). 

is This gives the total number of transaction with a discrepancy greater than some 

bound made by a cashier in a given month and year, e.g.: \ 
number_ofLtrans jArith_discrBpancy_greater_than(cashier_1 ,5,1 00,08/200 
3). * 
number_of_trans_with_discr^^ 

20 3). 

number_of_trans_with_discre^ 5,1 00,08/20 

03) 

number.oOrans^With^disctepancyjreater^thanCcashier^S^^OO.OB^OO 
3). 

25 dfscrepancy(Tnans_ID, Discrepancy). 

This gives the discrepancy between the expected amount of cash in the till and 
the actual amount of cash in the till for a particular transaction, e.g.: 
discrepancy(1 , 2.09). 
discrepancy^, 2.44). 
30 discrepancy{7, 7.98). 

total_trans(Cashier f Total number of transactions, Month and Year). 

This gives the total number of transactions made by the cashier in a given 

month and year, e.g.: 

totaLtrans(cashier_1 w 455, 08/2003). 
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totaLtrans(cashier_2, 345, 06/2003). 

number_oLtrans_vvitli_disorepancy(Cashier T Number, Month and Year). 
This gives the total number of transactions with a discrepancy made by a 
5 cashier in a given month and year, e.g.: 

number_ofjrans_with_.discrepancyCcashier_1 , 38, 08/2003). 
number_oLtrans_wlth_dfecrepancy(cashfer_2 l 93, 08/2003). 

number_ofJrans„v^_discrepant^^r©aterjthan{Cashier J Number. Bound, 
io Month and Year). 

This gives the total number of transaction with a discrepancy greater than some 
bound made by a cashier in a given month and year, e.g.: 

number_of_transjwith^^ T 5,1 00,08/200 

3). 

. 15 number_of_trans_with_discre^ ,3,150,08/200 

3). 

number_of_trans_with M .discrepancy_greater_than(cashier_2 i 1 5,1 00,08/20 
03) 

n umber_of jtrans_v^_discrepanc^^ 
20 3) 
Generated rule set: 

The target concept is fraudulent(Cashier). The rule set characterises a cashier who 
has made fraudulent transactions. 
fraudulent(Cashier) 

M . "urnber_of_trans_v^_discre^ Discrepancies, 

100, Month), 

Discrepancies £10. 
fraudulent(Cashier) :- 

totaLtransCCashier, TotalJTrans, Month), 
30 TotaLTrans > 455, 

number_of_trans_vy1th_discrepancy(Cashier a Discrepancies, Month), 

Discrepancies & 230. ~ . 

This example of a generated rule set characterises fraudulent cashiers using two rules. 
The first rtile indicates that a cashier is fraudulent if that in a single month, the cashier 
35 has performed at least 1 0 transactions with a discrepancy greater than 1 00. 
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The second rule describes a cashier as fraudulent if in a single month, the cashier has 
carried out at least 455 transactions, where at least 230 of these have had a discrepancy 
between the expected amount and the actual transaction amount 

The embodiment of the invention described above provides the following benefits: 

- it is fast because It prunes out duplicate rules avoiding unnecessary processing; 

• It can deal with and tune numerical and non-numerical constants to derive rules 
that bound variables (e.g. IF transaction value is between £19.45 and £67.89 
THEN ...); 

• It can make use of many different heuristics (decision techniques e,g, based on 
scores for accuracy), which can be changed and turned on or off by a user; 

• It uses a weighted relative accuracy measure in rule generation; 

• It develops rules that are readable and its reasoning can be understood (unlike 
a neural network for example); 

• It can be tuned to a particular application by adjusting its parameters and 
changing/adding heuristics; 

• It oan use relational and structural data that can be expressed In Prolog; 

• It can process numerical and non-numerical data; and 

• It can make use of expert knowledge encoded In Prolog. 

The process undertaken by the IUP engine at 12 as set out in the foregoing description 
20 can clearly be evaluated by an appropriate computer program comprising program 
Instructions embodied In an appropriate carrier medium and running on a conventional 
computer system. The computer program may be embodied in a memory, a floppy or 
compact or optical disc or other hardware recordal medium, or an electrical signal. Such 
a program is straightforward for a skilled programmer to implement on the basis of the - ~ 
25 foregoing description without requiring invention, because it involves well known 
computational procedures. 
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Claims 

1 . A method Qf anomaly detection characterised in that it Incorporates the steps of:- 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge, a rule 
covering a proportion of posffive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 

2. A method according to Claim 1 characterised in that data samples in the training data 
set have characters Indicating whether or not they are associated with anomalies. 



3. A method according to Claim 2 characterised In that it is a method of detecting 
telecommunications or retail fraud from anomalous data. 

4. A method according to Claim 3 characterised in thai it employs inductive logic 
programming to develop the rule set. 

5- A method according to Claim 4 characterised in that each rule has a form that an 
anomaly is detected or otherwise by application of the rule according to whether or not 
a condition set of at least One condition associated with the rule is fulfilled. 

6. A method according to Claim 5 characterised in that each rule is developed by refining 
a most general rule by at least one of: 

a) addition of a new condition to the condition set; 

b) unification of different variables to become constants or structured terms in 
condition. 

7. A method according to Claim 6 characte rised in that a variable in a rule which is 
defined as being in constant mode and is numerical is at least partly evaluated by 
providing a range of values for the variable, estimating an accuracy for each value and 
selecting a value having optimum accuracy, 

1 

8. A method according to Claim 7 characterised in that the range of values is a first 
range with values which are relatively widely spaced, a single optimum accuracy value 
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is obtained for the variable, and the method include? selecting a second and relatively 
narrowly spaced range of values in the optimum accuracy value T s vicinity, estimating 
an accuracy for each value in the second range and selecting a value in the second 
range having optimum accuracy. 

9. A method according to Claim 4 characterised in that it includes filtering to remove 
duplicates of rules and equivalents of rules, i.e. rules having like but differently 
ordered conditions compared to another rule, and tuIbs which have conditions which 
are symmetric compared to those of another rule. 

10, A method according to Claim 4 or 9 characterised in that it includes filtering to remove 
unnecessary less than or equal to' ("Iteq") conditions. 

11, A method according to Claim 10 characterised in thai the unnecessary "Iteq' 
conditions are associated with at least one of ends of intervals, multiple Iteq 
predicates and equality condition and Iteq duplication. 

12. A method according to Claim 4 characterised in that it includes implementing an 
encoding length restriction to avoid overfitting noisy data by rejecting a rule refinement 
if the refinement encoding cost In number of bite exceeds a cost of encoding the 
positive examples covered by the refinement. 

13- A method according to CJaim 4 characterised in that it includes stopping construction 
of a rule if at least one of three stopping criteria fs fulfilled as follows: 

a) the number of conditions in any rule In a beam of rules being processed is 
greater than or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule 
that 

i) is present in a beam currently being or having been processed, 
il) is significant, 

iii) has obtained a highest likelihood ratio statistic value found so far, and 

iy) has obtained an accuracy value greater than a most general rule 
accuracy value, and 

c) no refinements were produced which were eligible to enter the beam currently 
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being processed fn a mast recent refinement processing step (32). 

14. A method according to Claim 13 characterised in that it Includes adding the most 
significant rule to a list of derived rules and removing positive examples covered by 
the most significant rule from the training data set. 

15. A method according to Claim 4 characterised in that it includes; 

a) selecting rules which have not met rule construction stopping criteria, 

b) selecting a subset of refinements of the selected rufes associated with 
accuracy estimate scores higher than those of other refinements of the, 
selected rules, and 

c) iterating a rule refinement, filtering and evaluation procedure (32 to 38) to 
fdentffy any refined rule usable to test data. 

16. Computer apparatus for anomaly detection characterised in that it is programmed to 
execute the steps of 

a) developfng a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge, a rule 
covering a proportion of positive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 

17. Computer software for use in anomaly detection characterised in that it incorporates 
instructions for controlling computer apparatus to execute the steps of:- 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge, a rule 
covering a proportion of positive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 
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ABSTRACT 

A method of anomaly detection applicable to telecommunications or retail fraud uses 
inductive logic programming to develop anomaly characterisation rules from relevant 
background knowledge and a training data set, which includes positive anomaly samples of 
5 data covered by rules. Data samples include 1 or 0 indicating association or otherwise with 
anomalies. An anomaly is detected by a rule having condition set which the anomaly fulfils. 
Rules are developed by addition of conditions and unification of variables, and are filtered to 
remove duplicates, equivalents, symmetric rules and unnecessary conditions. Overrating of 
nofsy data is avoid by an encoding cost criterion. Termination of rufe construction Involves 
IQ criteria of rule length, absence of negative examples, rule significance and accuracy, and 
absence of recent refinement iteration of rule construction involves selecting rules with 
untermlnated construction, selecting rule refinements associated with high accuracies and 
iterating a rule refinement, filtering and evaluation procedure (32 to 38) to identify any 
refined rule usable to test data. 
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